mtmd: be able to use alternative types for the K*Q multiplication by ikawrakow · Pull Request #1567 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-04-01T16:06:32Z

I thought I should give some attention to the multi-modality stuff. The initial idea was that I would enable flash attention (FA). But that turned out to be too a big change as multi-modality models like to use strange attention head sizes. While looking into this I noticed that a very large fraction of the time needed to encode the image is spent in the K*Q matrix multiplication. So, I decided to see if that could be made somewhat faster.

When not using FA the K*Q matrix multiplication is done using 32-bit floats. An obvious thing to try is to see if down casting to f16/bf16, or perhaps even to Q8_0 would bring some performance benefit. Hence, this PR adds the ability to define the type used for the K*Q matrix multiplication via a command line argument

--mtmd-kq-type type

Somewhat surprisingly, I only see a performance improvement when running CPU-only on a Zen4 CPU (Ryzen-7950X) and using --mtmd-kq-type bf16. In that case, for a 1 MiB image, which generates 4015 image tokens, encoding time is reduced to 65 seconds from 76 seconds (I thought that was much too long, so tested the same image with today's llama.cpp. It needed ~300 seconds to encode the same image on the same CPU).

I also played with converting to Q8_0. That seems to work just fine (in terms of the generated response), but does not give a performance benefit. I guess, part of the issue is that the Qwen3 vision encoder has a head size of 72, so to use Q8_0 one must pad K and Q to a row size of 96, which a) takes time and b) makes the matrix multiplication 78% larger.

vikcious · 2026-04-01T20:48:58Z

... and congrats for 1000 PRs closed! I guess that these DO matter, prove your journey right and not the Git stars! 🙏🥃

ikawrakow added 3 commits April 1, 2026 17:58

mtmd: allow using types other than f32 for K*Q

3ff6084

Do not cast q if kq_type is quantized

a319642

Fix formatting

c679376

More formatting

e698e85

ikawrakow merged commit 73742c5 into main Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd: be able to use alternative types for the K*Q multiplication #1567

mtmd: be able to use alternative types for the K*Q multiplication #1567
ikawrakow merged 4 commits intomainfrom
ik/mtmd_kq_type

ikawrakow commented Apr 1, 2026

Uh oh!

vikcious commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ikawrakow commented Apr 1, 2026

Uh oh!

vikcious commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants